An Ensemble Clustering for Mining High-dimensional Biological Big Data

نویسندگان

  • DEWAN MD. FARID
  • ANN NOWE
  • BERNARD MANDERICK
چکیده

Clustering of high-dimensional biological big data is incredibly difficult and challenging task, as the data space is often too big and too messy. The conventional clustering methods can be inefficient and ineffective on high-dimensional biological big data, because traditional distance measures may be dominated by the noise in many dimensions. An additional challenge in biological big data is that we need to find not only the clusters of instances (genes), but also for each cluster a set of features (conditions) that manifest the cluster. In this paper, we propose an ensemble clustering approach with feature selection and grouping for clustering high-dimensional biological big data. It uses two well-approved clustering methods: (a) k-means clustering and (b) similarity-based clustering. This approach selects the most relevant features in the dataset and grouping them into subset of features to overcome the problems associated with the traditional clustering methods. Also, we applied biclustering on each cluster that generated by ensemble clustering to find the sub-matrices in the biological data by the mean squared residue scores. We have applied the proposed clustering method on unlabeled genomic data (148 Exome datasets) of Brugada syndrome to discover previously unknown data patterns. Experiments verify that the proposed clustering method achieved high performance clustering results on high-dimensional biological big data.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

High-Dimensional Unsupervised Active Learning Method

In this work, a hierarchical ensemble of projected clustering algorithm for high-dimensional data is proposed. The basic concept of the algorithm is based on the active learning method (ALM) which is a fuzzy learning scheme, inspired by some behavioral features of human brain functionality. High-dimensional unsupervised active learning method (HUALM) is a clustering algorithm which blurs the da...

متن کامل

A new ensemble clustering method based on fuzzy cmeans clustering while maintaining diversity in ensemble

An ensemble clustering has been considered as one of the research approaches in data mining, pattern recognition, machine learning and artificial intelligence over the last decade. In clustering, the combination first produces several bases clustering, and then, for their aggregation, a function is used to create a final cluster that is as similar as possible to all the cluster bundles. The inp...

متن کامل

A Novel Ensemble Approach for Anomaly Detection in Wireless Sensor Networks Using Time-overlapped Sliding Windows

One of the most important issues concerning the sensor data in the Wireless Sensor Networks (WSNs) is the unexpected data which are acquired from the sensors. Today, there are numerous approaches for detecting anomalies in the WSNs, most of which are based on machine learning methods. In this research, we present a heuristic method based on the concept of “ensemble of classifiers” of data minin...

متن کامل

Combining Classifier Guided by Semi-Supervision

The article suggests an algorithm for regular classifier ensemble methodology. The proposed methodology is based on possibilistic aggregation to classify samples. The argued method optimizes an objective function that combines environment recognition, multi-criteria aggregation term and a learning term. The optimization aims at learning backgrounds as solid clusters in subspaces of the high...

متن کامل

Entropy-based Consensus for Distributed Data Clustering

The increasingly larger scale of available data and the more restrictive concerns on their privacy are some of the challenging aspects of data mining today. In this paper, Entropy-based Consensus on Cluster Centers (EC3) is introduced for clustering in distributed systems with a consideration for confidentiality of data; i.e. it is the negotiations among local cluster centers that are used in t...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016